A AfterWork 


Ensemble Learning with Python 


Learning 
Outcomes 


By the end of this topic, you will have achieved the following learning 


outcomes: 
e |can understand the concept of ensemble learning when making predictions. 
e | can use ensemble learning methods to improve the performance of models. 
e | can differentiate between different ensemble learning methods. 
e |can understand the limitations of ensemble learning methods. 
“Predicting the future isn’t magic, it’s artificial intelligence.” ~Dave Waters 
Definitions 


Ensemble Learning is the process of using multiple predictive models to solve 
classification or regression problems. 


Max Voting (Majority Voting) - In the context of ensemble methods, this is a method 
used for classification problems, where multiple models are used to make predictions for 
each data point. The predictions by each model are considered as a 'vote'. Predictions 
gotten from the majority of models are used as the final predictions i.e. When 4 out 5 
models gave a classification of 1 and the other 0, then the majority voting leads to the 
prediction outcome being 1. We will get to use the VotingClassifier() module in sklearn 
to learn how to max voting happens. 


Averaging - In the context of regression, multiple predictions are made for each data 
point in averaging. An average of predictions from all the models is used to make the 
final prediction in regression problems, or while calculating probabilities for classification 
problems. 


Weighted Average - The weighted average is an extension of the average method, 
where all models are assigned different weights defining the importance of each model 
for prediction i.e. Two models out of five can be given more weights/importance as 
compared to other models. 


Overfitting - This is a term used to refer to a modeling error that introduces a bias to the 
model because it corresponds too closely to a particular set of data. It makes the model 
relevant to its dataset only and is irrelevant to any other dataset. Methods used to 
prevent overfitting include the use of ensemble techniques, cross-validation, etc. 
Underfitting is the vice-versa of overfitting. 


Variance - This is a measure of how the data distributes itself about the mean or 
expected value. 


Learners - Learners used in this session refer to models. 


Overview 


Ensemble methods/techniques are used with the goal of improving the accuracy of 
results in models. They make use of multiple models instead of using single models 
which increases the accuracy significantly. Ensemble techniques can be used for 
regression and classification, where they reduce bias and variance to boost accuracy. 


Applications 

Ensemble methods are useful in fields such as healthcare, where the smallest amount of 
improvement in accuracy can be valuable. They can also be used in banking where the 
smallest amount of accuracy can save funds. 


Reading 


Categories of Ensemble Techniques 
There are two broad categories of Ensemble Techniques 


a) Sequential Ensemble Techniques 
o These types of techniques are base learners in a sequence. The sequence 
generation of base learners promotes the dependence between the base 
learners. The performance of the model is improved by assigning higher weights 
to previously misrepresented learners. 
b) Parallel Ensemble Techniques 
a) These types of ensemble learners are base learners generated in a parallel 
format. They use the parallel generation of base learners to encourage 
independence between the base learners. The independence of base learners 
significantly reduces errors due to the applications of averages. 


Main Types of Ensemble Methods 


1. Bagging 
Bagging (also known as Bootstrap aggregation), is a type of ensemble method that 


increases the accuracy of models by averaging multiple estimates together, from which 
learners are trained on using random selections of features. 


The approach creates subsets from the training dataset. Upon performing the sampling 
method (also termed as bootstrapping), model predictions are combined and an average 
is used for regression, or a voting approach is used for classification. This would, in turn, 
decrease the overall variance, contributing to better model performance. 


An example of a bagging classification method is the Random Forests Classifier, in 
which individual decision trees classifiers are trained on different samples of the training 
dataset. 

As a result, there is a reduction of variance which increases accuracy, hence eliminating 
overfitting, which is a challenge to many predictive models. 


To break down further these bagging components i.e. aggregation and bootstrapping, 
let's go through a deeper explanation. 


a) Bootstrapping 

This is a sampling technique where samples are derived from the whole 
population using the replacement procedure. This sample, after replacement, is 
referred to as a resampling. This allows the model or algorithm to get a better 
understanding of the various biases, variances, and features that exist in the 
resample. Taking a sample of the data allows the resample to contain different 
characteristics which the sample might have contained. This would, in turn, affect 
the overall mean, standard deviation, and other descriptive metrics of a data set. 
Ultimately, leading to the development of more robust models. 


This bootstrapping method can also test the stability of a solution. By using 
multiple sample data sets and then testing multiple models, it can increase 
robustness. 


b) Aggregation 

Aggregation is done to incorporate all possible outcomes of the prediction and 
randomize the outcome. Predictions from the models are aggregated to make a 
final combined prediction. This aggregation is done on the basis of predictions 
made or the probability of the predictions made by the bootstrapped individual 
models. 


The following image depicts the above process of bootstrapping and aggregation: 
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To read more about bagging, we can refer to this reading. [Link] 






















Source: [Link] 


Benefits of bagging are: 
e Weak learners are combined to form a single strong learner that is more accurate 
than single learners. 
e By reducing variance, it reduces the overfitting of models. When the challenge in 
a single model is overfitting, the bagging performs better than boosting 
ensembles which we will cover in a short while. 


Trade-Offs 
e Bagging is computationally expensive and may not be desirable depending on 
the use case. 


Bagging Algorithms 


a. Bagging meta-estimator 


This is an ensemble technique that follows the typical bagging technique 
to make predictions. It can be used for both classification 
(BaggingClassifier) and regression (BaggingRegressor) problems. 


Below are the steps for the bagging meta-estimator algorithm: 
1. Random subsets are created from the original dataset 
(Bootstrapping). 
2. The subset of the dataset includes all features. 
3. A user-specified base estimator is fitted on each of these smaller 
sets. 
4. Predictions from each model are combined to get the final result. 


b. Random Forest 


This type of ensemble technique is an extension of the bagging estimator 
algorithm. It is a combination of decision trees that can be modeled for 
prediction and behavior analysis. 


It trains several decision trees in parallel with bootstrapping followed by 
aggregation. It makes a prediction by averaging the predictions of a 
component tree or in classification, taking one with the majority of votes 
as the selected prediction. 


It generally has much better predictive accuracy than a single decision 
tree and works well with default parameters. 


Below is an image that describes how random forest works: 
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Source [Link] 


Below are the steps for the random forest algorithm: 
1. Random subsets are created from the original dataset 
(bootstrapping). 
2. Ateach node in the decision tree, only a random set of features is 
considered to decide the best split. 
A decision tree model is fitted on each of the subsets. 
4. The final prediction is calculated by averaging the predictions from 
all decision trees. 


a 


Random forests have a variety of applications, such as recommendation 
engines, image classification, and feature selection. 


2. Boosting 


Boosting ensemble techniques, which are sequential methods, learn from previous 
predictor mistakes to make better predictions in the future. These types of techniques 
combine several weak base learners to form one strong learner and as a result 
significantly improve the predictability of models. 


Boosting works by arranging weak learners in a sequence, such that weak learners learn 
from the next learner in the sequence to create better predictive models. The predictions 
of the classifiers are aggregated and then the final predictions are made through a 
weighted sum (for regression) or a weighted majority vote (for classification). 


Boosting takes many forms i.e. gradient boosting, adaptive boosting (AdaBoost), and 
extreme gradient boosting (XGBoost). 

e Gradient boosting adds predictors sequentially to the ensemble, where preceding 
predictors correct their successors. New predictors are fit to counter the effects of 
errors in previous predictors. 

e XGBoost uses decision trees with gradient boosting. 


The following steps define how boosting works: 

A subset is created from the original dataset. 

Initially, all data points are given equal weights. 

A base model is created on this subset. 

This model is used to make predictions on the whole dataset. 

Errors are calculated using the actual values and predicted values. 

The observations which are incorrectly predicted, are given higher weights. 

Another model is created and predictions are made on the dataset. This model 

tries to correct the errors from the previous model. 

8. Similarly, multiple models are created, each correcting the errors of the previous 
model. 

9. The final model (strong learner) is the weighted mean of all the models (weak 
learners). 


SO Soe 


The boosting algorithm combines a number of weak learners to form a strong learner. 
The individual models would not perform well on the entire dataset, but they work well for 
some part of the dataset. Thus, each model actually boosts the performance of the 
ensemble. 


Boosting algorithms: 


a. AdaBoost 


In Adaptive boosting, multiple sequential models (decision trees) are created, 
each correcting the errors from the last model. Adaboost assigns weights to the 
observations which are incorrectly predicted and the subsequent works to predict 
these values correctly. 


The steps for performing AdaBoost modeling are as follows: 

Initially, all observations in the dataset are given equal weights. 

A model is built on a subset of data. 

Using this model, predictions are made on the whole dataset. 

Errors are calculated by comparing the predictions and actual values. 

While creating the next model, higher weights are given to the data points 

which were predicted incorrectly. 

6. Weights can be determined using the error value. 

7. This process is repeated until the error function does not change, or the 
maximum limit of the number of estimators is reached. 
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b. Gradient Boosting Machines (GBM) 


GBM works by using a number of weak learners to form a strong learner. 
Regression trees are used as a base learner, each subsequent tree in series is 
built on the errors calculated by the previous tree. 


c. XGBoost 


XGBoost is a decision-tree-based ensemble technique that uses gradient 
boosting. It has high predictive power and is almost 10 times faster than the other 
gradient boosting techniques. It includes a variety of regularization which reduces 
overfitting and improves overall performance. Hence it is also known as the 
regularized boosting technique. 


More specifically, XGBoost improves upon the base of gradient through systems 
optimization and algorithmic enhancements. System optimization involves 
parallelization, tree pruning, and hardware optimization while algorithmic 
enhancements include regularisation, sparsity awareness, weighted quantile 
sketch, and cross-validation. You care to read in-depth about these techniques 
by going through this [link]. 


d. Light GBM 


Light GBM works as a gradient boosting ensemble technique that works well 
when the dataset is extremely large and it takes less time. It is also based on 


tree-based algorithms and follows a leaf-wise approach as opposed to other 
algorithms which work in a level-wise approach. 

In certain cases, leaf-wise growth may cause overfitting on smaller datasets but 
that can be avoided by using the ‘max_depth’ parameter for learning. 


Below is an image that describes these two processes. 
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Level-wise tree growth 


Source [Link] 
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Leaf-wise tree growth 


Source [Link] 


. CatBoost 


Catboost is another type of gradient boosting ensemble technique that is able to 
deal with categorical variables with too many labels (i.e. they are highly cardinal) 
and does not require extensive data preprocessing like other machine learning 
algorithms. 


Performing one-hot-encoding on such categorical variables exponentially 
increases the dimensionality and it becomes really difficult to work with the 
dataset. CatBoost allows us to not perform one-hot encoding for categorical 
variables. 
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